Combining Words and Compound Terms for Monolingual and Cross-Language Information Retrieval

نویسندگان

  • Jian-Yun Nie
  • Jean-François Dufort
چکیده

Most existing systems of Information retrieval (IR) use single words as index to represent the contents of documents and queries. One of the consequences is the low recall level. In this paper, we propose to integrate compound terms as additional indexing units because terms are more precise representation units than words. Terms are recognized through the use of a terminology database and an automatic term extraction tool, which is based on syntactic templates and statistical analysis. In this paper, we first show that the use of compound terms is greatly beneficial to monolingual IR. Then compound terms are incorporated in statistical translation models trained on a large set of parallel texts. Our experiments on cross-language information retrieval (CLIR) show that such a translation model leads to a much better CLIR effectiveness when compound terms are integrated.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our sy...

متن کامل

Combination Approaches in Information Retrieval: Words vs. N-grams and Query Translation vs. Document Translation

This paper reports our proposal and experimental results at the NTCIR-4 CLIR task. For monolingual information retrieval, we use a combination strategy that integrates words and n-grams at the ranked list level. In combining words and n-grams, we concentrate on generating several ranked lists showing different retrieval characteristics on word and n-gram indexes by incorporating feedback scheme...

متن کامل

A Language-Independent Approach to European Text Retrieval

We present an approach to multilingual information retrieval that does not depend on the existence of specific linguistic resources such as stemmers or thesaurii. Using the HAIRCUT system we participated in the monolingual, bilingual, and multilingual tasks of the CLEF-2000 evaluation. Our method, based on combining the benefits of words and character n-grams, was effective for both language-in...

متن کامل

Cross-Language Information Retrieval for Technical Documents

This paper proposes a Japanese/English crosslanguage information retrieval (CLIR) system targeting technical documents. Our system rst translates a given query containing technical terms into the target language, and then retrieves documents relevant to the translated query. The translation of technical terms is still problematic in that technical terms are often compound words, and thus new te...

متن کامل

Semantic annotation for concept-based cross-language medical information retrieval

We present a framework for concept-based cross-language information retrieval in the medical domain, which is under development in the MUCHMORE project. Our approach is based on using the Unified Medical Language System (UMLS) as the primary source of semantic data. Documents and queries are annotated with multiple layers of linguistic information. Linguistic processing includes part-of-speech ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002